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Abstract 

Prior knowledge has been shown very use¬ 
ful to address many natural language process¬ 
ing tasks. Many approaches have been pro¬ 
posed to formalise a variety of knowledge, 
however, whether the proposed approach is ro¬ 
bust or sensitive to the knowledge supplied to 
the model has rarely been discussed. In this 
paper, we propose three regularization terms 
on top of generalized expectation criteria, and 
conduct extensive experiments to justify the 
robustness of the proposed methods. Exper¬ 
imental results demonstrate that our proposed 
methods obtain remarkable improvements and 
are much more robust than baselines. 


1 Introduction 


We posses a wealth of prior knowledge about many 
natural language processing tasks. For example, 
in text categorization, we know that words such as 
NBA, player, and basketball are strong indicators of 


the sports category (Druck et a l., 20 08), and words 
like terrible, boring , and messing indicate a nega¬ 
tive polarity while words like perfect, exciting , and 
moving suggest a positive polarity in sentiment clas¬ 
sification. 

A key problem arisen here, is how to leverage 
such knowledge to guide the learning process, an in¬ 
teresting problem for both NLP and machine learn¬ 
ing communities. Previous studies addressing the 
problem fall into several lines. First, to leverage 


prior knowledge to label data (Haghighi and Klein, 
2006; Raghavan and Allan , 200 7 ). Second, to en¬ 
code prior knowledge with a prior on parameters, 


which can be commonly seen in many Bayesian ap 
proaches ( [Andrzejewski and Zhu, 2009[ Andrzejew- 


ski et al., 2011). Third, to formalise prior knowl¬ 
edge with additional variables and dependencies (fLi 


et al., 2010). Last, to use prior knowledge to control 


the distributions over latent output variables ( [Graga 


et al., 2007} |McCallum et al., 2007[ |Chang et al., 


2007[ ), which makes the output variables easily in¬ 


terpretable. 


However, a crucial problem, which has rarely 
been addressed, is the bias in the prior knowledge 
that we supply to the learning model. Would the 
model be robust or sensitive to the prior knowledge? 
Or, which kind of knowledge is appropriate for the 
task? Let’s see an example: we may be a baseball 
fan but unfamiliar with hockey so that we can pro¬ 
vide a few number of feature words of baseball , but 
much less of hockey for a baseball-hockey classifi¬ 
cation task. Such prior knowledge may mislead the 
model with heavy bias to baseball. If the model 
cannot handle this situation appropriately, the per¬ 
formance may be undesirable. 

In this paper, we investigate into the problem in 
the framework of Generalized Expectation Criteria 
( McCallum et al., 2007| ). The study aims to reveal 
the factors of reducing the sensibility of the prior 
knowledge and therefore to make the model more 
robust and practical. To this end, we introduce aux¬ 
iliary regularization terms in which our prior knowl¬ 
edge is formalized as distribution over output vari¬ 
ables. Recall the example just mentioned, though we 
do not have enough knowledge to provide features 
for class hockey , it is easy for us to provide some 
neutral words, namely words that are not strong in- 




































dicators of any class, like player here. As one of the 
factors revealed in this paper, supplying neutral fea¬ 
ture words can boost the performance remarkably, 
making the model more robust. 

More attractively, we do not need manual annota¬ 
tion to label these neutral feature words in our pro¬ 
posed approach. 

More specifically, we explore three regulariza¬ 
tion terms to address the problem: (1) a regulariza¬ 
tion term associated with neutral features; (2) the 
maximum entropy of class distribution regulariza¬ 
tion term; and (3) the KL divergence between ref¬ 
erence and predicted class distribution. For the first 
manner, we simply use the most common features as 
neutral features and assume the neutral features are 
distributed uniformly over class labels. For the sec¬ 
ond and third one, we assume we have some knowl¬ 
edge about the class distribution which will be de¬ 
tailed soon later. 

To summarize, the main contributions of this 
work are as follows: 

• We explore three regularization terms to make 
the model more robust: a regularization term 
associated with neutral features, the maxi¬ 
mum entropy of class distribution regulariza¬ 
tion term, and the KL divergence between ref¬ 
erence and predicted class distribution. 


2.1 Generalized Expectation Criteria 

Generalized expectation (GE) criteria (McCallum et 
al., 2007| ) provides us a natural way to directly con¬ 
strain the model in the preferred direction. For ex¬ 
ample, when we know the proportion of each class 
of the dataset in a classification task, we can guide 
the model to predict out a pre-specified class distri¬ 
bution. 

Formally, in a parameter estimation objective 
function, a GE term expresses preferences on the 
value of some constraint functions about the model’s 
expectation. Given a constraint function G(x, y), a 
conditional model distribution po(y |x), an empiri¬ 
cal distribution p(x) over input samples and a score 
function S , a GE term can be expressed as follows: 

S( E iKx)[E ps ( y \ x )[G(x,y)}}) ( 1 ) 

2.2 Learning from Labeled Features 

Druck et al. ( |2008| proposed GE-FL to learn from 
labeled features using generalized expectation crite¬ 
ria. When given a set of labeled features K , the ref¬ 
erence distribution over classes of these features is 
denoted by p(y\xk), k E K. GE-FL introduces the 
divergence between this reference distribution and 
the model predicted distribution po(y\xk) , as a term 
of the objective function: 


• Experiments demonstrate that the proposed ap¬ 
proaches outperform baselines and work much 
more robustly. 

The rest of the paper is structured as follows: In 
Section 2, we briefly describe the generalized expec¬ 
tation criteria and present the proposed regulariza¬ 
tion terms. In Section 3, we conduct extensive ex¬ 
periments to justify the proposed methods. We sur¬ 
vey related work in Section 4, and summarize our 
work in Section 5. 


2 Method 


We address the robustness problem on top of GE- 
FL ( [Druck et al., 2008) , a GE method which lever¬ 
ages labeled features as prior knowledge. A labeled 
feature is a strong indicator of a specific class and 
is manually provided to the classifier. For example, 
words like amazing, exciting can be labeled features 
for class positive in sentiment classification. 


q2 

o=y2 KL (p(y\ x k)\\pe(y\ x k))+y2^2 

k£K y,i 

where 9 y i is the model parameter which indicates 
the importance of word i to class y. The predicted 
distribution po{y\%k) can be expressed as follows: 

Pe(y\x k ) = Ty^p 0 (y| x )/(a; A ,) 

in which I(xk ) is 1 if feature k occurs in instance 
x and 0 otherwise, Ck = J2*H x k) is the number 
of instances with a non-zero value of feature fc, and 
po(y\x) takes a softmax form as follows: 

Pe(y\x) = T^yexp (Y^OyiXi)- 

To solve the optimization problem, L-BFGS can 
be used for parameter estimation. 














In the framework of GE, this term can be ob¬ 
tained by setting the constraint function G(x, y) = 
-£hl(y)l(xk), where I(y) is an indicator vector with 
1 at the index corresponding to label y and 0 else¬ 
where. 

2.3 Regularization Terms 

GE-FL reduces the heavy load of instance anno¬ 
tation and performs well when we provide prior 
knowledge with no bias. In our experiments, we ob¬ 
serve that comparable numbers of labeled features 
for each class have to be supplied. But as mentioned 
before, it is often the case that we are not able to 
provide enough knowledge for some of the classes. 
For the baseball-hockey classification task, as shown 
before, GE-FL will predict most of the instances as 
baseball. In this section, we will show three terms 
to make the model more robust. 

2.3.1 Regularization Associated with Neutral 
Features 

Neutral features are features that are not infor¬ 
mative indicator of any classes, for instance, word 
player to the baseball-hockey classification task. 
Such features are usually frequent words across all 
categories. When we set the preference distribu¬ 
tion of the neutral features to be uniform distributed, 
these neutral features will prevent the model from 
biasing to the class that has a dominate number of 
labeled features. 

Formally, given a set of neutral features K , the 
uniform distribution is p u (y\xk) = £ K , 

where \C\ is the number of classes. The objective 
function with the new term becomes 

One = O + KL (Pu(y\x k )\\Pd(y\xk))- (3) 

keK' 

Note that we do not need manual annotation to pro¬ 
vide neutral features. One simple way is to take the 
most common features as neutral features. Experi¬ 
mental results show that this strategy works success¬ 
fully. 

2.3.2 Regularization with Maximum Entropy 
Principle 

Another way to prevent the model from drifting 
from the desired direction is to constrain the pre¬ 
dicted class distribution on unlabeled data. When 


lacking knowledge about the class distribution of the 
data, one feasible way is to take maximum entropy 
principle, as below: 

Ome = O + \y^p(y)\ogp(y) (4) 
y 

where p(y) is the predicted class distribution, given 
by p(y) = jjq J2*Po(y l x )- To control the influence 
of this term on the overall objective function, we can 
tune A according to the difference in the number of 
labeled features of each class. In this paper, we sim¬ 
ply set A to be proportional to the total number of 
labeled features, say A = /3\K\. 

This maximum entropy term can be derived by 
setting the constraint function to G(x,y) = I(y). 
Therefore, E pe ^ y | x ) [G(x, y)\ is just the model distri¬ 
bution po(y |x) and its expectation with the empiri¬ 
cal distribution p(x) is simply the average over input 
samples, namely p(y). When S takes the maximum 
entropy form, we can derive the objective function 
as above. 

2.3.3 Regularization with KL Divergence 

Sometimes, we have already had much knowl¬ 
edge about the corpus, and can estimate the class 
distribution roughly without labeling instances. 
Therefore, we introduce the KL divergence between 
the predicted and reference class distributions into 
the objective function. 

Given the preference class distribution p(y), we 
modify the objective function as follows: 

0 KL = 0 + \KL(p(y)\\p(y)) (5) 

Similarly, we set A = /3\K\. 

This divergence term can be derived by setting the 
constraint function to G(x, y) = I(y) and setting the 
score function to S(p,p) = JTp^log^, where p 
and p are distributions. Note that this regularization 
term involves the reference class distribution which 
will be discussed later. 

3 Experiments 

In this section, we first justify the approach when 
there exists unbalance in the number of labeled fea¬ 
tures or in class distribution. Then, to test the in¬ 
fluence of A, we conduct some experiments with 
the method which incorporates the KL divergence of 



class distribution. Last, we evaluate our approaches 
in 9 commonly used text classification datasets. We 
set A = 51 K | by default in all experiments unless 
there is explicit declaration. 

The baseline we choose here is GE-FL ( [Druck et 


al., 2008j ), a method based on generalization expec¬ 
tation criteria. 


any other class is 

Neutral features are the most frequent words af¬ 
ter removing stop words, and their reference distri¬ 
butions are uniformly distributed. We use the top 10 
frequent words as neutral features in all experiments. 

3.2 With Unbalanced Labeled Features 


3.1 Data Preparation 

We evaluate our methods on several commonly used 
datasets whose themes range from sentiment, web¬ 
page, science to medical and healthcare. We use 
bag-of-words feature and remove stopwords in the 
preprocess stage. Though we have labels of all doc¬ 
uments, we do not use them during the learning pro¬ 
cess, instead, we use the label of features. 

The movie dataset, in which the task is to classify 
the movie reviews as positive or negtive , is used for 
testing the proposed approaches with unbalanced la¬ 
beled features, unbalanced datasets or different A pa¬ 
rameter^] All unbalanced datasets are constructed 
based on the movie dataset by randomly removing 
documents of the positive class. For each experi¬ 
ment, we conduct 10-fold cross validation. 


Labeled Features 


As described in (Druck et al., 2008), there are two 
ways to obtain labeled features. The first way is to 
use information gain. We first calculate the mutual 
information of all features according to the labels of 
the documents and select the top 20 as labeled fea¬ 
tures for each class as a feature pool. Note that using 
information gain requires the document label, but 
this is only to simulate how we human provide prior 
knowledge to the model. The second way is to use 
LDA ( |Blei et al., 2003| ) to select features. We use the 
same selection process as ( [Druck et al., 2008| ), where 
they first train a LDA on the dataset, and then select 
the most probable features of each topic (sorted by 
P(wi\tj), the probability of word Wi given topic tj). 

Similar to ( [Schapire et al., 2002 ; [Druck et al., 


2008), we estimate the reference distribution of the 


labeled features using a heuristic strategy. If there 
are \C\ classes in total, and n classes are associated 
with a feature k, the probability that feature k is re¬ 
lated with any one of the n classes is ^ and with 


In this section, we evaluate our approach when there 
is unbalanced knowledge on the categories to be 
classified. The labeled features are obtained through 
information gain. Two settings are chosen: 

(a) We randomly select t G [1, 20] features from 
the feature pool for one class, and only one feature 
for the other. The original balanced movie dataset is 
used (positive:negative=l:l). 

(b) Similar to (a), but the dataset is unbalanced, 
obtained by randomly removing 75% positive docu¬ 
ments (positive:negative=l:4). 

As shown in Figure [I] Maximum entropy princi¬ 
ple shows improvement only on the balanced case. 
An obvious reason is that maximum entropy only fa¬ 
vors uniform distribution. 

Incorporating Neutral features performs similarly 
to maximum entropy since we assume that neutral 
words are uniformly distributed. Its accuracy de¬ 
creases slowly when the number of labeled features 
becomes larger (t > 4) (Figure[lja)), suggesting that 
the model gradually biases to the class with more la¬ 
beled features, just like GE-FL. 

Incorporating the KL divergence of class distribu¬ 
tion performs much better than GE-FL on both bal¬ 
anced and unbalanced datasets. This shows that it is 
effective to control the unbalance in labeled features 
and in the dataset. 

3.3 With Balanced Labeled Features 

We also compare with the baseline when the labeled 
features are balanced. Similar to the experiment 
above, the labeled features are obtained by informa¬ 
tion gain. Two settings are experimented with: 

(a) We randomly select t G [1, 20] features from 
the feature pool for each class, and conduct compar¬ 
isons on the original balanced movie dataset (posi¬ 
tive :negtive=l:l). 


1 We also experimented on other datasets, and observed sim¬ 
ilar results. 


2 Previous work shows the model is insensitive to the setting 
of the reference distribution. 























(a) balanced dataset (b) unbalanced dataset( 1:4) 

Figure 1: Performance with unbalanced labeled features, tested on the movie dataset. Randomly select from the 
feature pool t (x-axis) labeled features for one class, and select 1 feature for the other. The unbalanced datasets in (b) 
are constructed by randomly removing 75% of the positive documents. 
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(a) balanced dataset 


(b) unbalanced dataset( 1:4) 


Figure 2: Performance with balanced labeled features, tested on the movie dataset. Randomly select from the feature 
pool t (x-axis) labeled features for each class. The unbalanced datasets in (b) are constructed by randomly removing 
75% of the positive documents. 


(b) Similar to (a), but the class distribution is un¬ 
balanced, by randomly removing 75% positive doc¬ 
uments (positive:negative=l:4). 

Results are shown in Figure[2j When the dataset is 
balanced (Figure |2ja)), there is little difference be¬ 
tween GE-FL and our methods. The reason is that 
the proposed regularization terms provide no addi¬ 
tional knowledge to the model and there is no bias in 
the labeled features. On the unbalanced dataset (Fig¬ 
ure (2jb)), incorporating KF divergence is much bet¬ 
ter than GE-FF since we provide additional knowl- 
edge(the true class distribution), but maximum en¬ 
tropy and neutral features are much worse because 
forcing the model to approach the uniform distribu¬ 


tion misleads it. 

3.4 With Unbalanced Class Distributions 

Our methods are also evaluated on datasets with dif¬ 
ferent unbalanced class distributions. We manually 
construct several movie datasets with class distribu¬ 
tions of 1:2, 1:3, 1:4 by randomly removing 50%, 
67%, 75% positive documents. The original bal¬ 
anced movie dataset is used as a control group. We 
test with both balanced and unbalanced labeled fea¬ 
tures. For the balanced case, we randomly select 10 
features from the feature pool for each class, and for 
the unbalanced case, we select 10 features for one 
class, and 1 feature for the other. Results are shown 











































Class Distribution 

(a) balanced labeled features 



Class Distribution 


(b) unbalanced labeled features 


Figure 3: Influence of unbalanced class distribution, tested on the movie dataset. We provide 10 labeled features for 
each class in (a), and, 10 for one class and 1 for the other in (b). The three unbalanced datasets are constructed by 
randomly removing 50%, 67%, 75% documents of the positive class. 




(a) balanced dataset (b) unbalanced dataset( 1:4) 

Figure 4: The influence of A, tested on the movie dataset. Randomly select from the feature pool t (x-axis) labeled 
features for one class, and keep 1 feature for the other. The unbalanced datasets in (b) are constructed by randomly 
removing 75% documents of the positive class. 


in Figure [3] 

Figure [3] (a) shows that when the dataset and 
the labeled features are both balanced, there is lit¬ 
tle difference between our methods and GE-FL(also 
see Figure |2ja)). But when the class distribution 
becomes more unbalanced, the difference becomes 
more remarkable. Performance of neutral features 
and maximum entropy decrease significantly but 
incorporating KL divergence increases remarkably. 
This suggests if we have more accurate knowledge 
about class distribution, KL divergence can guide 
the model to the right direction. 

Figure [3] (b) shows that when the labeled features 
are unbalanced, our methods significantly outper¬ 


forms GE-FL. Incorporating KL divergence is ro¬ 
bust enough to control unbalance both in the dataset 
and in labeled features while the other three methods 
are not so competitive. 

3.5 The Influence of A 

We present the influence of A on the method that 
incorporates KL divergence in this section. Since we 
simply set A = /3\K\, we just tune (3 here. Note that 
when f3 — 0, the newly introduced regularization 
term is disappeared, and thus the model is actually 
GE-FL. Again, we test the method with different A 
in two settings: 

(a) We randomly select t G [1, 20] features from 



























Dataset 

GE-FL 

Neutral Features 

Max Entropy 

KL Divergence 

movie 

0.623 

0.672 

0.681 

0.684 

sraa 

0.559 

0.628 

0.618 

0.547 

webkb 

0.615 

0.685 

0.640 

0.646 

med-space 

0.927 

0.921 

0.936 

0.928 

ibm-mac 

0.817 

0.796 

0.837 

0.833 

baseball-hockey 

0.915 

0.935 

0.925 

0.923 

20 newsgroups 

0.667 

0.669 

0.680 

0.677 

financial-healthcare 

0.588 

0.618 

0.618 

0.507 

sector.top 

0.596 

0.653 

0.581 

0.639 


Table 1: Performance using LDA-features. Bold text means the method performs better than GE-FL. 


the feature pool for one class, and only one fea¬ 
ture for the other class. The original balanced movie 
dataset is used (positive:negative=l:l). 

(b) Similar to (a), but the dataset is unbalanced, 
obtained by randomly removing 75% positive docu¬ 
ments (positive:negative=l:4). 

Results are shown in Figure [4] As expected, A re¬ 
flects how strong the regularization is. The model 
tends to be closer to our preferences with the in¬ 
creasing of A on both cases. 


3.6 Using LDA Selected Features 

We compare our methods with GE-FL on all the 9 
datasets in this section. Instead of using features 
obtained by information gain, we use LDA to se¬ 
lect labeled features. Unlike information gain, LDA 
does not employ any instance labels to find labeled 
features. In this setting, we can build classifica¬ 
tion models without any instance annotation, but just 
with labeled features. 

Table[l]shows that our three methods significantly 
outperform GE-FL. Incorporating neutral features 
performs better than GE-FL on 7 of the 9 datasets, 
maximum entropy is better on 8 datasets, and KL 
divergence better on 7 datasets. 

LDA selects out the most predictive features as 
labeled features without considering the balance 
among classes. GE-FL does not exert any control 
on such an issue, so the performance is severely suf¬ 
fered. Our methods introduce auxiliary regulariza¬ 
tion terms to control such a bias problem and thus 
promote the model significantly. 


4 Related Work 


There have been much work that incorporate prior 
knowledge into learning, and two related lines are 
surveyed here. One is to use prior knowledge to la¬ 
bel unlabeled instances and then apply a standard 
learning algorithm. The other is to constrain the 
model directly with prior knowledge. 

Liu et al.( |2004| manually labeled features which 
are highly predictive to unsupervised clustering as¬ 
signments and use them to label unlabeled data. 
Chang et al.( |2007| proposed constraint driven learn¬ 
ing. They first used constraints and the learned 
model to annotate unlabeled instances, and then up¬ 
dated the model with the newly labeled data. Daume 
( |2008| ) proposed a self training method in which sev¬ 
eral models are trained on the same dataset, and only 
unlabeled instances that satisfy the cross task knowl¬ 
edge constraints are used in the self training process. 

MaCallum et al.( |2007| ) proposed generalized ex- 
pectation(GE) criteria which formalised the knowl¬ 
edge as constraint terms about the expectation of the 


model into the objective function. Graga et al.( [2007| 
proposed posterior regularization(PR) framework 
which projects the model’s posterior onto a set of 
distributions that satisfy the auxiliary constraints. 
Druck et al.( [2008| ) explored constraints of labeled 
features in the framework of GE by forcing the 
model’s predicted feature distribution to approach 
the reference distribution. Andrzejewski et al.( |2011| ) 
proposed a framework in which general domain 
knowledge can be easily incorporated into LDA. 
Altendorf et al.( 2012| ) explored monotonicity con¬ 
straints to improve the accuracy while learning from 


sparse data. Chen et al.(2013) tried to learn compre- 
































hensible topic models by leveraging multi-domain 
knowledge. 

Mann and McCallum ( |2007| [20101 ) incorporated 
not only labeled features but also other knowledge 
like class distribution into the objective function 
of GE-FL. But they discussed only from the semi- 
supervised perspective and did not investigate into 
the robustness problem, unlike what we addressed 
in this paper. 

There are also some active learning methods try¬ 
ing to use prior knowledge. Raghavan et al.( |2006| ) 
proposed to use feedback on instances and features 
interlacedly, and demonstrated that feedback on fea¬ 
tures boosts the model much. Druck et al.( |2009| ) 
proposed an active learning method which solicits 
labels on features rather than on instances and then 
used GE-FL to train the model. 


5 Conclusion and Discussions 

This paper investigates into the problem of how to 
leverage prior knowledge robustly in learning mod¬ 
els. We propose three regularization terms on top 
of generalized expectation criteria. As demonstrated 
by the experimental results, the performance can 
be considerably improved when taking into account 
these factors. Comparative results show that our pro¬ 
posed methods is more effective and works more ro¬ 
bustly against baselines. To the best of our knowl¬ 
edge, this is the first work to address the robustness 
problem of leveraging knowledge, and may inspire 
other research. 

We then present more detailed discussions about 
the three regularization methods. Incorporating neu¬ 
tral features is the simplest way of regularization, 
which doesn’t require any modification of GE-FL 
but just finding out some common features. But as 
Figure [lja) shows, only using neutral features are 
not strong enough to handle extremely unbalanced 
labeled features. 

The maximum entropy regularization term shows 
the strong ability of controlling unbalance. 

This method doesn’t need any extra knowledge, 
and is thus suitable when we know nothing about the 
corpus. But this method assumes that the categories 
are uniformly distributed, which may not be the case 
in practice, and it will have a degraded performance 
if the assumption is violated (see Figure[ljb), Figure 


0b), Figure 0a)). 

The KL divergence performs much better on un¬ 
balanced corpora than other methods. The reason is 
that KL divergence utilizes the reference class distri¬ 
bution and doesn’t make any assumptions. The fact 
suggests that additional knowledge does benefit the 
model. 

However, the KL divergence term requires pro¬ 
viding the true class distribution. Sometimes, we 
may have the exact knowledge about the true dis¬ 
tribution, but sometimes we may not. Fortunately, 
the model is insensitive to the true distribution and 
therefore a rough estimation of the true distribu¬ 
tion is sufficient. In our experiments, when the true 
class distribution is 1:2, where the reference class 
distribution is set to 1:1.5/1:2/l:2.5, the accuracy is 
0.755/0.756/0.760 respectively. This provides us the 
possibility to perform simple computing on the cor¬ 
pus to obtain the distribution in reality. Or, we can 
set the distribution roughly with domain expertise. 
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